Analysis of Airbnb Data in NYC 2019

Introduction

This is a short data analysis of Airbnb listings in New York City (NYC) in 2019. The data was taken from https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/.

Package imports:

library(tidyverse)
library(knitr)

Read in dataset:

df <- read_csv("airbnb_nyc_2019.csv", 
               col_types = cols(host_id = col_character(), 
                                id = col_character(), 
                                last_review = col_date(format = "%Y-%m-%d")))

Getting a feel for the data

The dataset has 48895 rows and 16 columns. Here are the first 6 rows of the dataset:

kable(head(df))

id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
3647	THE VILLAGE OF HARLEM….NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NA	NA	1	365
3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0
5099	Large Cozy 1 BR Apartment In Midtown East	7322	Chris	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	3	74	2019-06-22	0.59	1	129

Here are the column names of the dataset:

names(df)

##  [1] "id"                             "name"                          
##  [3] "host_id"                        "host_name"                     
##  [5] "neighbourhood_group"            "neighbourhood"                 
##  [7] "latitude"                       "longitude"                     
##  [9] "room_type"                      "price"                         
## [11] "minimum_nights"                 "number_of_reviews"             
## [13] "last_review"                    "reviews_per_month"             
## [15] "calculated_host_listings_count" "availability_365"

Each row in this dataset is an Airbnb listing. Sanity check: listing IDs are unique in this dataset.

length(unique(df$id))

## [1] 48895

We note that there are some NA values in the reviews_per_month column. This probably because there were zero reviews. Let’s fill that in:

# if reviews_per_month is empty, it probably means zero reviews
df$reviews_per_month <- replace_na(df$reviews_per_month, 0)

No. of listings by neighborhood

Manhattan has the most number of listings, followed by Brooklyn. It looks like most of the listings are either “Entire home/apt” or “Private room”, with a pretty even split between these two types.

ggplot(df, aes(x = fct_infreq(neighbourhood_group), fill = room_type)) +
    geom_bar() +
    labs(title = "No. of listings by borough",
         x = "Borough", y = "No. of listings") +
    theme(legend.position = "bottom")

Below is a plot of the top 10 neighborhoods by number of listings. All of them are either from Brooklyn or Manhattan.

df %>%
    group_by(neighbourhood) %>%
    summarize(num_listings = n(), 
              borough = unique(neighbourhood_group)) %>%
    top_n(n = 10, wt = num_listings) %>%
    ggplot(aes(x = fct_reorder(neighbourhood, num_listings), 
               y = num_listings, fill = borough)) +
    geom_col() +
    coord_flip() +
    theme(legend.position = "bottom") +
    labs(title = "Top 10 neighborhoods by no. of listings",
         x = "Neighborhood", y = "No. of listings")

Price by room type

The plot below shows the distribution of price by room type. (Note that the y-axis is on a log scale.) There is much variation in price within each room type. Overall, it looks like “Entire home/apt” listings are slightly pricier than “Private room”, which in turn are more expensive than “Shared room”. This makes intuitive sense.

ggplot(df, aes(x = room_type, y = price)) +
    geom_violin() +
    scale_y_log10()

In making this plot, we noticed that 11 listings had price as zero. We are not sure why this is the case, but since it is such a small fraction of listings, we will ignore it for this analysis.

df %>% filter(price == 0) %>%
    select(name, host_id, host_name, neighbourhood_group, room_type, minimum_nights)

## # A tibble: 11 x 6
##    name        host_id  host_name neighbourhood_g… room_type minimum_nights
##    <chr>       <chr>    <chr>     <chr>            <chr>              <dbl>
##  1 Huge Brook… 8993084  Kimberly  Brooklyn         Private …              4
##  2 ★Hostel St… 1316975… Anisha    Bronx            Private …              2
##  3 MARTIAL LO… 15787004 Martial … Brooklyn         Private …              2
##  4 Sunny, Qui… 1641537  Lauren    Brooklyn         Private …              2
##  5 Modern apa… 10132166 Aymeric   Brooklyn         Entire h…              5
##  6 Spacious c… 86327101 Adeyemi   Brooklyn         Private …              1
##  7 Contempora… 86327101 Adeyemi   Brooklyn         Private …              1
##  8 Cozy yet s… 86327101 Adeyemi   Brooklyn         Private …              1
##  9 the best y… 13709292 Qiuchi    Manhattan        Entire h…              3
## 10 Coliving i… 1019705… Sergii    Brooklyn         Shared r…             30
## 11 Best Coliv… 1019705… Sergii    Brooklyn         Shared r…             30

Relationship between number of listings and median price by neighborhood

Does the number of listings in a neighborhood affect the prices of those listings? For each neighborhood, we look at the number of listings as well as its median price. In the plot below, each neighborhood is presented by one point, and its color represents the borough it belongs to.

# compute summary statistics for each neighborhood
nhd_df <- df %>%
    group_by(neighbourhood) %>%
    summarize(num_listings = n(),
              median_price = median(price),
              long = median(longitude),
              lat = median(latitude),
              borough = unique(neighbourhood_group))

nhd_df %>%
    ggplot(aes(x = num_listings, y = median_price, col = borough)) +
    geom_point(alpha = 0.5) + geom_smooth(se = FALSE) +
    scale_x_log10() + scale_y_log10() +
    theme_minimal() +
    theme(legend.position = "bottom")

Within each borough, it looks like the number of listings in a neighborhood does not have much of an impact on the median price of the listing.

Map of the top 50 most expensive listings

library(ggmap)

# get top 50 listings by price
top_df <- df %>% top_n(n = 50, wt = price)

# get background map
top_height <- max(top_df$latitude) - min(top_df$latitude)
top_width <- max(top_df$longitude) - min(top_df$longitude)
top_borders <- c(bottom  = min(top_df$latitude)  - 0.1 * top_height,
                 top     = max(top_df$latitude)  + 0.1 * top_height,
                 left    = min(top_df$longitude) - 0.1 * top_width,
                 right   = max(top_df$longitude) + 0.1 * top_width)

top_map <- get_stamenmap(top_borders, zoom = 12, maptype = "toner-lite")

# map of top 50 most expensive
ggmap(top_map) +
    geom_point(data = top_df, mapping = aes(x = longitude, y = latitude,
                                        col = price)) +
    scale_color_gradient(low = "blue", high = "red")

Most of them are located in Manhattan.

Median price by neighborhood

In the map below, each dot is one neighborhood. The size of the dot depends on the number of listings and the color of the dot depends on the median price in that neighborhood.

# map of all listings: one point per neighborhood
height <- max(df$latitude) - min(df$latitude)
width <- max(df$longitude) - min(df$longitude)
borders <- c(bottom  = min(df$latitude)  - 0.1 * height,
             top     = max(df$latitude)  + 0.1 * height,
             left    = min(df$longitude) - 0.1 * width,
             right   = max(df$longitude) + 0.1 * width)

map <- get_stamenmap(borders, zoom = 11, maptype = "toner-lite")
ggmap(map) +
    geom_point(data = nhd_df, mapping = aes(x = long, y = lat,
                                            col = median_price, size = num_listings)) +
    scale_color_gradient(low = "blue", high = "red")

The median price for most neighborhoods is quite low; it looks somewhat elevated in Manhattan. Also, there are one or two neighborhoods with very high median prices in Staten Island: this is worth investigating further.

Conclusion

There is much in this dataset that we have not explored yet. At first glance, it appears that room type and neighborhood have an effect on the listing price, but not the number of listings in the neighborhood.